-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect metrics about the active/idle connections to ES nodes #141434
Collect metrics about the active/idle connections to ES nodes #141434
Conversation
Pinging @elastic/kibana-core (Team:Core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments and a whole bunch of questions but I'll review again once CI goes green.
There will be some follow-up work to add these metrics to Cloud
|
||
expect(httpAgents.size).toEqual(2); | ||
expect(httpsAgents.size).toEqual(2); | ||
expect(httpAgents.has(agent1)).toEqual(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative:
expect([...httpAgents]).toEqual(expect.arrayContaining([agent1, agent2]));
expect([...httpAgents]).not.toEqual(expect.arrayContaining([agent3, agent4]));
expect([...httpsAgents]).toEqual(expect.arrayContaining([agent3, agent4]));
expect([...httpsAgents]).not.toEqual(expect.arrayContaining([agent1, agent2]));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think that would work
expect.arrayContaining(array) matches a received array which contains **all** of the elements in the expected array
expect([...httpsAgents]).not.toEqual(expect.arrayContaining([agent1, agent2]));
would be passing for [agent1, agent3]
AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! I'll simplify what can be simplified then.
const httpAgents = new Set<HttpAgent>([new HttpAgent(), new HttpAgent()]); | ||
const httpsAgents = new Set<HttpsAgent>([new HttpsAgent(), new HttpsAgent()]); | ||
|
||
AgentManagerMock.mockImplementationOnce(() => ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the longer term, we might need a proper mock for AgentManager
but for now, it's probably ok to inline here.
} | ||
|
||
public reset() { | ||
// TODO check if we have to implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd probably want to reset the collector on rolling restarts, to ensure we're not carrying over any stale metrics. Even if not needed, it's probably a good idea to reset the collector anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client metrics are kinda like os/cgroups metrics though, in the way that the actual collector doesn't hold any data and just delegates to a lower-level actor, so I'm not sure to see what we could / should be cleaning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That answer the question, thanks!
expect(getAgentsSocketsStats).toHaveBeenCalledTimes(2); | ||
expect(getAgentsSocketsStats).toHaveBeenNthCalledWith(1, httpAgents); | ||
expect(getAgentsSocketsStats).toHaveBeenNthCalledWith(2, httpsAgents); | ||
expect(metrics).toMatchInlineSnapshot(` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using snapshots in the metrics service has, in my experience, been a nightmare with flaky tests. We're already seeing failures for https://github.com/elastic/kibana/pull/141434/files#diff-3876c2ad50be4ead02241bceff9d5a9ee0d493a5fb53ff27c628cecc5d2716a6R72.
const agents = new Set<Agent>([new Agent(), new Agent()]); | ||
|
||
const stats = getAgentsSocketsStats(agents); | ||
expect(stats).toMatchInlineSnapshot(` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using snapshots has, in my experience, been flaky with the metrics service. We should be proactive and create test fixtures for these, similarly to how we test the process and event loop delays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall implementation looking good.
Just a bunch of remarks and NITs
expect(httpAgents.size).toEqual(2); | ||
expect(httpsAgents.size).toEqual(2); | ||
expect(httpAgents.has(agent1)).toEqual(true); | ||
expect(httpAgents.has(agent2)).toEqual(true); | ||
expect(httpAgents.has(agent3)).toEqual(false); | ||
expect(httpAgents.has(agent4)).toEqual(false); | ||
expect(httpsAgents.has(agent1)).toEqual(false); | ||
expect(httpsAgents.has(agent2)).toEqual(false); | ||
expect(httpsAgents.has(agent3)).toEqual(true); | ||
expect(httpsAgents.has(agent4)).toEqual(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think the order in these arrays are determined by the order of creation, right? So at this point, why just not check for exact array content?
expect(httpAgents).toEqual([agent1, agent2]);
expect(httpsAgents).toEqual([agent3, agent4]);
public getHttpAgents(): Set<NetworkAgent> { | ||
return this.httpStore; | ||
} | ||
|
||
public getHttpsAgents(): Set<NetworkAgent> { | ||
return this.httpsStore; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can return Set<HttpAgent>
for getHttpAgents
and Set<HttpsAgent>
for getHttpsAgents
, can't we? Would be a little more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N/A as I've merged both Agent types into a single set (commit coming soon).
@@ -12,6 +12,7 @@ import type { | |||
ElasticsearchServiceStart, | |||
ElasticsearchServiceSetup, | |||
} from '@kbn/core-elasticsearch-server'; | |||
import { AgentManager } from '@kbn/core-elasticsearch-client-server-internal'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: import type
@@ -94,6 +95,7 @@ const createInternalSetupContractMock = () => { | |||
level: ServiceStatusLevels.available, | |||
summary: 'Elasticsearch is available', | |||
}), | |||
agentManager: new AgentManager(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @TinaHeiligers's remark: we'd ideally want a proper mocked version here instead of delegating to the actual implementation.
Note that given it is only exposed on the internal contract / mock, the impact is not that significant.
jest.mock('@kbn/core-elasticsearch-client-server-internal'); | ||
jest.mock('./get_agents_sockets_stats'); | ||
|
||
const AgentManagerMock = AgentManager as jest.MockedClass<typeof AgentManager>; | ||
const getAgentsSocketsStatsMock = getAgentsSocketsStats as jest.MockedFunction< | ||
typeof getAgentsSocketsStats | ||
>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: For consistency, would be better to use the pattern of declaring a test's mocks in a [file].test.mocks.ts
file instead, and importing the mocked things from it, e.g packages/core/metrics/core-metrics-collectors-server-internal/src/os.test.mocks.ts
} | ||
|
||
public reset() { | ||
// TODO check if we have to implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client metrics are kinda like os/cgroups metrics though, in the way that the actual collector doesn't hold any data and just delegates to a lower-level actor, so I'm not sure to see what we could / should be cleaning here?
totalActiveSockets += sockets?.length ?? 0; | ||
nodesWithActiveSockets[node] = (nodesWithActiveSockets[node] ?? 0) + (sockets?.length ?? 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I would set sockets?.length ?? 0;
into a variable to avoid repeating it (same with freeSockets?.length ?? 0
below).
this.metricsCollector = new OpsMetricsCollector( | ||
http.server, | ||
elasticsearchService.agentManager, | ||
{ | ||
logger: this.logger, | ||
...config.cGroupOverrides, | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was gonna ask why you didnt add agentManager
to the second param, but looking at it, I see we're using the OpsMetricsCollectorOptions
both for OpsMetricsCollector
and OsMetricsCollector
construction.
Outside of the scope of this PR, but we should fix that, and have OpsMetricsCollector
use its own single structure / type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #141922 for it
describe('#start', () => { | ||
it('invokes setInterval with the configured interval', async () => { | ||
await metricsService.setup({ http: httpMock }); | ||
await metricsService.setup({ http: httpMock, elasticsearchService: esServiceMock }); | ||
await metricsService.start(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually missing a test to assert what parameters the OpsMetricsCollector
instance is created with during #setup
.
Optional given it wasn't here in the first place, but we could add one.
/** Number of HTTP Agents that have are currently being used by the ES-js client */ | ||
agents: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: can be http or https. would just remove this word from the description
export interface ElasticsearchClientsMetrics { | ||
/** Number of HTTP Agents that have are currently being used by the ES-js client */ | ||
agents: number; | ||
/** Number of ES instances (or proxies) that ES-js client is connecting to */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
/** Number of ES instances (or proxies) that ES-js client is connecting to */ | |
/** Number of ES nodes that ES-js client is connecting to */ |
Kibana doesn't support connecting through a proxy.
export interface ElasticsearchClientsMetricsByProtocol { | ||
http: ElasticsearchClientsMetrics; | ||
https: ElasticsearchClientsMetrics; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value in having separate metrics for http and https?
Is it even possible to configure Kibana to use https and http nodes? Is it even possible to run an Elasticsearch cluster with mixed http / https nodes?
As a first step we just want to use this data to understand and optimize the default performance on cloud. But if this is useful to us it's useful to customers too. But if there's two keys then all the dashboards built on this data will have to have two queries and it will always show e.g. "http.averageActiveSocketsPerNode" and "https.averageActiveSocketsPerNode" even if only one will have data. So this kinda makes the dashboard noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's technically possible, at least from a configuration standpoint. When users specify the hosts:
property they could be introducing mixed http and https hosts.
I was thinking that we would mainly rely on "https" for dashboards, which will be the normal for our cloud deployments.
Then, if we find a faulty deployment with degraded performance, we could take a look and see if there are open sockets in the 'http' protocol? I'm not really sure how helpful it can be.
If you think it brings more noise than value I can simply remove the 'http' one, or try to merge them together into a single "store". Alternatively, from the Kibana metrics beat, we could simply consume the 'https' one. This way, the /api/stats
will continue to expose both (which could be useful for investigations on self-managed), and the monitoring cluster will only ingest the 'https' one, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine it might be possible to configure mixed http/https nodes for Elasticsearch for high availability you might have mixed nodes while you switch nodes in your cluster one at a time to serving over https. But I suspect this is extremely rare as a long term configuration.
Mixed protocols means the elasticsearch.maxSockets
configuration option behaves differently and Kibana would have double as many sockets as a pure http or pure https configuration. In such a case only seeing the https sockets in a dashboard would be misleading and I think it would be more useful to see all sockets to all protocols on a dashboard.
So it feels like merging the agents would always provide a more holistic picture of how many sockets your Kibana is opening. I'm not sure if knowing if it's https or http would make us make any different decisions, in some sense it feels like an implementation detail, but we could have a field like protocol: 'http' | 'https' | 'mixed'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, I'll update my PR accordingly, thanks!
else if (http) protocol = 'http'; | ||
else protocol = 'none'; | ||
|
||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where it happens but I tested this by creating a lot of idle connections and then getting /api/stats which returns:
...
"elasticsearch_client": {
"protocol": "https",
"connected_nodes": 1,
"nodes_with_active_sockets": 0,
"nodes_with_idle_sockets": 1,
"total_active_sockets": 0,
"total_idle_sockets": 4494,
"total_queued_requests": 0,
"most_active_node_sockets": null,
"average_active_sockets_per_node": null,
"most_idle_node_sockets": 4494,
"average_idle_sockets_per_node": 4494
},
...
We would probably use a long
mapping for these fields so then I think a value of null
(like for average_active_sockets_per_node) would cause an indexing error on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'm adding some UT to cover these edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few remarks, but looking good to me
@@ -26,7 +26,7 @@ import { ScopedClusterClient } from './scoped_cluster_client'; | |||
import { getDefaultHeaders } from './headers'; | |||
import { createInternalErrorHandler, InternalUnauthorizedErrorHandler } from './retry_unauthorized'; | |||
import { createTransport } from './create_transport'; | |||
import { AgentManager } from './agent_manager'; | |||
import { AgentFactoryProvider } from './agent_manager'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: import type
@@ -12,7 +12,7 @@ import type { ElasticsearchClientConfig } from '@kbn/core-elasticsearch-server'; | |||
import { parseClientOptions } from './client_config'; | |||
import { instrumentEsQueryAndDeprecationLogger } from './log_query_and_deprecation'; | |||
import { createTransport } from './create_transport'; | |||
import { AgentManager } from './agent_manager'; | |||
import { AgentFactoryProvider } from './agent_manager'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: import type
* Side Public License, v 1. | ||
*/ | ||
|
||
import { AgentStore, NetworkAgent } from '@kbn/core-elasticsearch-client-server-internal'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: import type
constructor(private readonly agentStore: AgentStore) {} | ||
|
||
public async collect(): Promise<ElasticsearchClientsMetrics> { | ||
return getAgentsSocketsStats(this.agentStore.getAgents()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: return await getAgentsSocketsStats
for better stacktraces
jest.doMock('http'); | ||
const agent = new HttpAgent(); | ||
return Object.assign(agent, defaults); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we doing here exactly? what's the intent of the in-function doMock
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal was to mock HttpAgent and allow overriding a few of its properties (mainly the lists of sockets
and freeSockets
). There's probably a better way to do it 😬
UPDATE: I need a "real" instance so that the code in getAgentsSocketsStats
that checks instanceof HttpsAgent
does not fail for the test.
UPDATE 2: I removed the doMock(..)
statements. They're unnecessary cause I don't need to mock any methods of the class afterwards.
let protocol: ElasticsearchClientProtocol; | ||
|
||
if (http && https) protocol = 'mixed'; | ||
else if (https) protocol = 'https'; | ||
else if (http) protocol = 'http'; | ||
else protocol = 'none'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: The dark side of the inlining is not strong in this one.
const protocol: ElasticsearchClientProtocol = http ? https ? 'mixed' : 'http' ? https ? 'https' : 'none';
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Public APIs missing comments
Public APIs missing exports
Unknown metric groupsAPI count
History
To update your PR or re-run it, just comment with: |
…c#141434) * Collect metrics about the connections from esClient to ES nodes * Misc enhancements following PR remarks and comments * Fix UTs * Fix mock typings * Minimize API surface, fix mocks typings * Fix incomplete mocks * Fix renameed agentManager => agentStore in remaining UT * Cover edge cases for getAgentsSocketsStats() * Misc NIT enhancements * Revert incorrect import type statements
…c#141434) * Collect metrics about the connections from esClient to ES nodes * Misc enhancements following PR remarks and comments * Fix UTs * Fix mock typings * Minimize API surface, fix mocks typings * Fix incomplete mocks * Fix renameed agentManager => agentStore in remaining UT * Cover edge cases for getAgentsSocketsStats() * Misc NIT enhancements * Revert incorrect import type statements
[PR #141434](#141434) exposes a bunch of metrics related to the Elasticsearch Client in the `/api/stats` endpoint. While all these stats are interesting, some of them might be less relevant than others right now. Let's start by exposing only those stats that are more critical from a monitoring standpoint. <img width="440" alt="image" src="https://user-images.githubusercontent.com/25349407/201688243-4e33cd88-5fa2-48b7-b8ca-2fd175271adc.png">
As part of #134362, we want to have more visibility on the amount of open connections to ES nodes, for our different deployments.
With a cloud-first mindset, we are adding a new collector that will retrieve socket information from AgentManager, exposing this information on the
/api/stats
endpoint, so that it can be consumed by the Kibana Metric Beat, and sent over to our monitoring cluster.Here's an example of what the new properties look like (extracted from a local environment):